[1] 11
With Applications in R
University of Utah, EEUU
2024-08-23
Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:
Big data: How to work with data that doesn’t fit your computer
Parallel computing: How to take advantage of multiple core systems
Compiled code: Write your own low-level code (if R doesn’t has it yet…)
(Checkout CRAN Task View on HPC)
High Throughput Computing Cluster (HTC Cluster)
Supercomputer (HPC Cluster)
Single Instruction, Multiple Data (SIMD)
In terms of scale
HTC > HPC > Single node > Socket > Core > Thread | SIMD vectorization
How many cores does your computer has?
Ask yourself these questions before jumping into HPC!
Things that are easily parallelizable:
Things that are not easily parallelizable:
Parallelization is not free: Most cost is in sending+receiving data.
In R (and other flavors), you can mitigate by (i) reducing the amount of data communicated, and (ii) reducing the number of times you communicate.
Overhead cost of parallelization: Fitting \(y = \alpha + \beta_k X_k + \varepsilon,\quad k = 1, \dots\) (more about this later)
While there are several alternatives (just take a look at the High-Performance Computing Task View), we’ll focus on the following R-packages for explicit parallelism
Some examples:
Implicit parallelism, on the other hand, are out-of-the-box tools that allow the programmer not to worry about parallelization, e.g. such as gpuR for Matrix manipulation using GPU, tensorflow
And there’s also a more advanced set of options
PSOCK, Fork, MPI, etc.(Usually) We do the following:
PSOCK/FORK (or other) cluster using makePSOCKCluster/makeForkCluster (or makeCluster)Copy/prepare each R session (if you are using a PSOCK cluster):
Copy objects with clusterExport
Pass expressions with clusterEvalQ
Set a seed
parApply, parLapply, etc.clusterStop| Type | Description | Pros | Cons |
|---|---|---|---|
PSOCK |
Multiple machines via socket connection | Works in all OSs | Slowest |
FORK |
Single machine via forking | Avoids memory duplication | Only for Unix-based |
MPI1 |
Multiple machines via Message Passage Interface | Best alternative for HPC clusters | Sometimes hard to setup |
Using PSOCK, the slurmR package creates clusters featuring multiple nodes in HPC environments, think hundreds of cores.
# 1. CREATING A CLUSTER
library(parallel)
cl <- makePSOCKcluster(4)
x <- 20
# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`
clusterExport(cl, "x")
# 3. DO YOUR CALL
clusterEvalQ(cl, {
paste0("Hello from process #", Sys.getpid(), ". x = ", x)
})[[1]]
[1] "Hello from process #52321. x = 20"
[[2]]
[1] "Hello from process #52320. x = 20"
[[3]]
[1] "Hello from process #52323. x = 20"
[[4]]
[1] "Hello from process #52322. x = 20"
Problem: Run multiple regressions on a very wide dataset. We need to fit the following model:
\[ y = X_i\beta_i + \varepsilon,\quad \varepsilon\sim N(0, \sigma^2_i),\quad\forall i \]
[1] 500 999
x001 x002 x003 x004 x005
1 0.61827227 1.72847041 -1.4810695 -0.2471871 1.4776281
2 0.96777456 -0.19358426 -0.8176465 0.6356714 0.7292221
3 -0.04303734 -0.06692844 0.9048826 -1.9277964 2.2947675
4 0.84237608 -1.13685605 -1.8559158 0.4687967 0.9881953
5 -1.91921443 1.83865873 0.5937039 -0.1410556 0.6507415
6 0.59146153 0.81743419 0.3348553 -1.8771819 0.8181764
num [1:500] -0.8188 -0.5438 1.0209 0.0467 -0.4501 ...
cl <- parallel::makePSOCKcluster(4L)
ans <- parallel::parApply(
cl = cl,
X = X,
MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
)
ans[,1:5] x001 x002 x003 x004 x005
(Intercept) -0.03449819 -0.03339681 -0.03728140 -0.03644192 -0.03717344
x -0.06082548 0.02748265 -0.01327855 -0.08012361 -0.04067826
Are we going any faster?
library(microbenchmark)
microbenchmark(
parallel = parallel::parApply(
cl = cl,
X = X, MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
),
serial = apply(
X = X, MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
),
times = 10,
unit = "relative"
)Unit: relative
expr min lq mean median uq max neval cld
parallel 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
serial 2.639255 2.644114 2.491739 2.590543 2.496979 2.070579 10 b
Problem: We want to bootstrap a logistic regression model. We need to fit the following model:
\[ P(Y=1) = \text{logit}^{-1}\left(X\beta\right) \]
[1] 100 5
[,1] [,2] [,3] [,4] [,5]
[1,] -0.13592452 1.1921489 -1.04101654 0.26500638 -0.51561099
[2,] -0.04079697 -0.1231379 -0.43190705 1.38694989 0.39568325
[3,] 1.01053901 -0.5741648 -0.77781632 -0.29149014 -0.78301461
[4,] -0.15826244 -1.4903169 0.37368178 -1.83027672 0.88538861
[5,] -2.15663750 2.3638289 0.31256458 -1.62766978 -0.38212891
[6,] 0.49864683 -2.9510362 0.07122864 -0.01630346 0.05333596
[1] 1 1 0 0 1 0
my_boot <- function(y, X, B=1000) {
# Generating the indices
n <- length(y)
indices <- sample.int(n = n, size = n * B, replace = TRUE) |>
matrix(nrow = n)
# Fitting the model
apply(indices, 2, function(i) {
glm(y[i] ~ X[i,], family = binomial("logit")) |>
coef()
}) |> t()
}
set.seed(3312)
ans <- my_boot(y, X, B=50)
head(ans) (Intercept) X[i, ]1 X[i, ]2 X[i, ]3 X[i, ]4 X[i, ]5
[1,] 2.943576 -0.46986617 2.292807 1.1069735 2.117947 0.7839228
[2,] 4.265760 -0.01445575 3.881603 2.5052960 4.300462 0.0542386
[3,] 2.702185 -0.40973910 2.315127 1.1693082 3.059388 0.2927383
[4,] 4.827939 -1.52854114 2.692226 1.7977035 4.370736 0.7825011
[5,] 3.229396 -0.56316370 1.980704 1.4054200 3.949632 0.2806117
[6,] 2.933971 0.25911455 2.193838 0.6953409 1.970649 -0.3528708
my_boot_pll <- function(y, X, cl, B=1000) {
# Generating the indices
n <- length(y)
indices <- sample.int(n = n, size = n * B, replace = TRUE) |>
matrix(nrow = n)
# Making sure y and X are available in the cluster
parallel::clusterExport(cl, c("y", "X"))
# Fitting the model
parallel::parApply(cl, indices, 2, function(i) {
glm(y[i] ~ X[i,], family = binomial("logit")) |>
coef()
}) |> t()
}
cl <- parallel::makeForkCluster(4)
set.seed(3312)
ans_pll <- my_boot_pll(y, X, cl, B=50)
head(ans_pll) (Intercept) X[i, ]1 X[i, ]2 X[i, ]3 X[i, ]4 X[i, ]5
[1,] 2.943576 -0.46986617 2.292807 1.1069735 2.117947 0.7839228
[2,] 4.265760 -0.01445575 3.881603 2.5052960 4.300462 0.0542386
[3,] 2.702185 -0.40973910 2.315127 1.1693082 3.059388 0.2927383
[4,] 4.827939 -1.52854114 2.692226 1.7977035 4.370736 0.7825011
[5,] 3.229396 -0.56316370 1.980704 1.4054200 3.949632 0.2806117
[6,] 2.933971 0.25911455 2.193838 0.6953409 1.970649 -0.3528708
How much faster?
microbenchmark::microbenchmark(
parallel = my_boot_pll(y, X, cl, B=1000),
serial = my_boot(y, X, B=1000),
times = 1,
unit = "s"
)Unit: seconds
expr min lq mean median uq max neval
parallel 0.1942538 0.1942538 0.1942538 0.1942538 0.1942538 0.1942538 1
serial 0.5178959 0.5178959 0.5178959 0.5178959 0.5178959 0.5178959 1
Problem: Revisit of the overhead cost of parallelization. We want to fit the following model \[y = X_k\beta_k + \varepsilon,\quad k = 1, \dots\]
[,1] [,2] [,3] [,4] [,5]
[1,] -1.91835255 -0.2359106 -1.4642601 -0.5320349 -0.4639574
[2,] -0.09017806 -0.1022420 -0.6735899 1.6146947 -2.3792154
[3,] -1.25551672 1.2079800 0.2159515 -0.1323614 0.9867689
[4,] 1.28006769 -0.2806277 -0.2026345 -0.7375033 -0.1067501
[1] 0.3570720 0.4850507 1.0281664 -0.7044579 1.1378356 -0.9032009
Let’s start with the naive approach: fitting the model and returning the full output.
| Elapsed time (s) | |
|---|---|
| Serial | 2.715 |
| Parallel naive | 4.834 |
The problem: we are returning a lot of information that we may not need:
Instead of capturing the full output, we can just return the coefficients.
| Elapsed time (s) | |
|---|---|
| Serial | 2.715 |
| Parallel naive | 4.834 |
| Parallel coef | 0.690 |
The coefficients are much smaller, significantly reducing the overhead cost to about 0.8 Mb.
Pro-tip
Using a Fork cluster instead of a PSOCK cluster can further reduce the overhead cost. Both X and y would have been automatically available in the Fork cluster at 0 cost.
gvegayon
ggvy.cl
george.vegayon@utah.edu
For more, checkout the CRAN Task View on HPC
We know that \(\pi = \frac{A}{r^2}\). We approximate it by randomly adding points \(x\) to a square of size 2 centered at the origin.
So, we approximate \(\pi\) as \(\Pr\{\|x\| \leq 1\}\times 2^2\)
The R code to do this
library(parallel)
# Setup
cl <- makePSOCKcluster(4L)
clusterSetRNGStream(cl, 123)
# Number of simulations we want each time to run
nsim <- 1e5
# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))
# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
microbenchmark::microbenchmark(
parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
serial = sapply(1:100, pisim, nsim=nsim),
times = 10,
unit = "relative"
)Unit: relative
expr min lq mean median uq max neval cld
parallel 1.000000 1.0000 1.00000 1.000000 1.000000 1.000000 10 a
serial 1.867702 1.8882 1.98731 1.907443 2.290021 2.120476 10 b
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin23.4.0
Running under: macOS Sonoma 14.6.1
Matrix products: default
BLAS: /opt/homebrew/Cellar/openblas/0.3.27/lib/libopenblasp-r0.3.27.dylib
LAPACK: /opt/homebrew/Cellar/r/4.4.1/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Denver
tzcode source: internal
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] microbenchmark_1.4.10
loaded via a namespace (and not attached):
[1] digest_0.6.36 codetools_0.2-20 multcomp_1.4-25 fastmap_1.2.0
[5] Matrix_1.7-0 xfun_0.45 lattice_0.22-6 TH.data_1.1-2
[9] splines_4.4.1 zoo_1.8-12 knitr_1.47 htmltools_0.5.8.1
[13] rmarkdown_2.27 mvtnorm_1.2-5 cli_3.6.3 grid_4.4.1
[17] sandwich_3.1-0 compiler_4.4.1 tools_4.4.1 evaluate_0.24.0
[21] survival_3.6-4 yaml_2.3.8 rlang_1.1.4 jsonlite_1.8.8
[25] MASS_7.3-60.2